Introduction

In this session we will use example datasets from single-cell RNA sequencing and spatial transcriptomics to demonstrate additional types of visualizations, including reduced dimension plots, spatial plots, and heatmaps.

We will use R packages from Bioconductor for the data processing and some of the visualizations.

To install packages from Bioconductor, first install the BiocManager package from CRAN with install.packages("BiocManager"), and then use the command BiocManager::install().

Load packages

library(tidyverse)
library(here)
library(ggplot2)
library(SingleCellExperiment)
library(SpatialExperiment)
library(STexampleData)
library(TENxPBMCData)
library(scater)
library(scran)
library(pheatmap)
library(ComplexHeatmap)

Single-cell dataset

Load data

We load an example single-cell dataset from the TENxPBMCData package. This package contains single-cell RNA sequencing data from peripheral blood mononuclear cells (PBMCs) measured with 10x Genomics platforms, and saved in SingleCellExperiment format.

# load dataset
sce <- TENxPBMCData(dataset = "pbmc4k")
## see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
## loading from cache
sce
## class: SingleCellExperiment 
## dim: 33694 4340 
## metadata(0):
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(3): ENSEMBL_ID Symbol_TENx Symbol
## colnames: NULL
## colData names(11): Sample Barcode ... Individual Date_published
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
dim(sce)
## [1] 33694  4340

Workflow

We run some steps from a short analysis workflow on this dataset. We skip several analysis steps here (in particular quality control), since we are mainly interested in using this dataset for demonstration purposes to generate several new types of visualizations.

For more details on the analysis workflow, see the Orchestrating Single-Cell Analysis with Bioconductor online book.

# preprocessing
rownames(sce) <- rowData(sce)$Symbol_TENx

# normalization
set.seed(123)
quickclus <- quickCluster(sce)
table(quickclus)
## quickclus
##   1   2   3   4   5   6   7   8   9  10  11 
## 588 361 264 161 866 365 531 536 360 207 101
sce <- computeSumFactors(sce, cluster = quickclus)
sce <- logNormCounts(sce)

# variance modeling
set.seed(123)
dec <- modelGeneVarByPoisson(sce)
hvgs <- getTopHVGs(dec, prop = 0.1)

# dimensionality reduction
set.seed(123)
sce <- denoisePCA(sce, subset.row = hvgs, technical = dec)
set.seed(123)
sce <- runTSNE(sce, dimred = "PCA")
set.seed(123)
sce <- runUMAP(sce, dimred = "PCA")

# clustering
g <- buildSNNGraph(sce, k = 10, use.dimred = "PCA")
clus <- igraph::cluster_walktrap(g)$membership
table(clus)
## clus
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
## 663 610 249 353 166 322 993 391 165  47 156  95  78  23  29
colLabels(sce) <- factor(clus)

# differential expression
markers <- findMarkers(sce, pval.type = "some", direction = "up")

Reduced dimension plots

Here, we demonstrate several types of reduced dimension plots. These provide alternative ways to visualize structure in high-dimensional data.

To do this, we need to first create a data frame containing all the variables to plot. These are stored in colData(sce) and reducedDim(sce) in the SingleCellExperiment object.

# create data frame
df <- cbind.data.frame(
  colData(sce), 
  reducedDim(sce, "PCA"), 
  reducedDim(sce, "TSNE"), 
  reducedDim(sce, "UMAP")
)

head(df, 3)
##   Sample            Barcode         Sequence Library Cell_ranger_version
## 1 pbmc4k AAACCTGAGAAGGCCT-1 AAACCTGAGAAGGCCT       1                v2.1
## 2 pbmc4k AAACCTGAGACAGACC-1 AAACCTGAGACAGACC       1                v2.1
## 3 pbmc4k AAACCTGAGATAGTCA-1 AAACCTGAGATAGTCA       1                v2.1
##   Tissue_status Barcode_type   Chemistry Sequence_platform    Individual
## 1          <NA>     Chromium Chromium_v2         Hiseq4000 HealthyDonor1
## 2          <NA>     Chromium Chromium_v2         Hiseq4000 HealthyDonor1
## 3          <NA>     Chromium Chromium_v2         Hiseq4000 HealthyDonor1
##   Date_published sizeFactor label      PC1        PC2       PC3       PC4
## 1     2017-11-08  0.4032459     6 17.04326 -0.2718908 -1.202488 -3.967078
## 2     2017-11-08  0.7397772     1 16.31126 -2.1582928 -5.494709 -1.364433
## 3     2017-11-08  0.4294928     6 20.69168  4.9046899  1.752722 -5.893432
##          PC5        PC6       PC7     TSNE1    TSNE2     UMAP1      UMAP2
## 1  1.7250706 -0.6468792 0.6157505 -34.45390 4.498456 -13.82578  0.4271781
## 2 -0.9776328 -2.5657073 0.2732182 -23.05328 6.489842 -12.44451 -1.0040864
## 3 -1.5285852  3.5752710 3.4892835 -40.73745 4.826225 -14.42250  1.3044480

First, we generate a principal component analysis (PCA) plot showing the first two principal components (PCs) using ggplot. We plot clusters by color and add several additional formatting options.

# PCA plot
ggplot(df, aes(x = PC1, y = PC2, color = label)) + 
  geom_point(alpha = 0.5, size = 0.5) + 
  ggtitle("PCA") + 
  theme_bw() + 
  theme(panel.grid = element_blank())

Alternatively, we can generate t-SNE or UMAP plots. The t-SNE and UMAP algorithms are commonly used in the single-cell field to visualize clustering results in high-dimensional data. These are relatively complicated non-linear algorithms, which tend to provide a better separation of clusters in a two-dimensional visualization. However, caution is also needed when interpreting the plots, since they can be misleading by introducing cluster separation where none exists, and by generating an arbitrary relative spatial arrangement of the clusters.

Observe that the clusters are more clearly separated in the UMAP plot than in the t-SNE plot, and that the relative spatial arrangement of the clusters differs between the two plots (and also differs between random seeds when running the algorithms).

# t-SNE plot
ggplot(df, aes(x = TSNE1, y = TSNE2, color = label)) + 
  geom_point(alpha = 0.5, size = 0.5) + 
  ggtitle("t-SNE") + 
  theme_bw() + 
  theme(panel.grid = element_blank())

# UMAP plot
ggplot(df, aes(x = UMAP1, y = UMAP2, color = label)) + 
  geom_point(alpha = 0.5, size = 0.5) + 
  ggtitle("UMAP") + 
  theme_bw() + 
  theme(panel.grid = element_blank())

Heatmaps

Here, we demonstrate heatmaps as a way to visualize the top expressed marker genes per cluster.

First, we obtain the top marker genes for cluster 5 using some additional code continuing from the workflow above.

# top marker genes for cluster 5
top_markers <- markers[["5"]]
best <- top_markers[1:20, ]
lfcs <- getMarkerEffects(best)

Now, create a heatmap using the pheatmap package. This shows marker genes in rows, clusters in columns, and log-fold-change expression as color gradient. Rows and columns are automatically grouped using additional clustering (which may either help the reader or possibly mislead or overcomplicate things, depending on the dataset).

# heatmap using pheatmap
pheatmap(lfcs, breaks = seq(-5, 5, length.out = 101))

An alternative package to create heatmaps is ComplexHeatmap, which provides extensive and powerful options for customized formatting, and includes detailed documentation and tutorials.

Here, we re-create the above heatmap using ComplexHeatmap, along with additional formatting options.

# heatmap using ComplexHeatmap
Heatmap(lfcs)

# ComplexHeatmap with additional formatting
Heatmap(
  lfcs, 
  name = "log2-fold change", 
  column_title = "cluster", 
  column_title_side = "bottom", 
  row_title = "gene", 
  row_names_gp = gpar(fontface = "italic"), 
  rect_gp = gpar(col = "white", lwd = 1.5)
)

Violin plots

Alternatively, we can also use violin plots to visualize the top marker genes per cluster. Below, we show an example using a plotting function from the scater package.

This shows the same information as the heatmap above, represented using a different type of visualization. (Which is more effective for this dataset?)

# violin plots
plotExpression(sce, features = rownames(best), x = "label", color_by = "label")

Spatial transcriptomics dataset

Load data

In this example, we use a spatial transcriptomics dataset from the STexampleData package. This is a spatial transcriptomics dataset from the dorsolateral prefrontal cortex (DLPFC) region in a postmortem human brain sample, measured with the 10x Genomics Visium platform, and saved in SpatialExperiment format.

# load dataset
spe <- Visium_humanDLPFC()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## loading from cache
spe
## class: SpatialExperiment 
## dim: 33538 4992 
## metadata(0):
## assays(1): counts
## rownames(33538): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(3): gene_id gene_name feature_type
## colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ...
##   TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1
## colData names(7): barcode_id sample_id ... ground_truth cell_count
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## spatialCoords names(2) : pxl_col_in_fullres pxl_row_in_fullres
## imgData names(4): sample_id image_id data scaleFactor
dim(spe)
## [1] 33538  4992

Workflow

We run some steps from a short analysis workflow to prepare the dataset. As above, we skip several analysis steps here, since we are using the dataset for demonstration purposes to generate visualizations only.

For more details on the analysis workflow, see the Best Practices for Spatial Transcriptomics Analysis with Bioconductor online book.

The ggspavis package also provides plotting functions for spatial transcriptomics data. Here, we use ggplot code instead, to show how the plots are built up.

# preprocessing
spe <- spe[, colData(spe)$in_tissue == 1]
colnames(spatialCoords(spe)) <- c("x", "y")
colData(spe)$label <- colData(spe)$ground_truth

# normalization
spe <- logNormCounts(spe)

# feature selection
is_mito <- grepl("(^MT-)|(^mt-)", rowData(spe)$gene_name)
table(is_mito)
## is_mito
## FALSE  TRUE 
## 33525    13
spe <- spe[!is_mito, ]
dec <- modelGeneVar(spe)
top_hvgs <- getTopHVGs(dec, prop = 0.1)

# dimensionality reduction
set.seed(123)
spe <- runPCA(spe, subset_row = top_hvgs)
set.seed(123)
spe <- runUMAP(spe, dimred = "PCA")
colnames(reducedDim(spe, "UMAP")) <- paste0("UMAP", 1:2)

Spatial plots

For spatial data, we can generate a spatial plot showing annotation labels or expression intensity in x-y coordinates.

First, we create a data frame containing the data for each spot (spatial measurement location), stored in colData(spe), spatialCoords(spe), and reducedDim(spe) in the SpatialExperiment object.

# set up data frame
df <- cbind.data.frame(
  colData(spe), 
  spatialCoords(spe), 
  reducedDim(spe, "PCA"), 
  reducedDim(spe, "UMAP")
)

head(df, 2)
##                            barcode_id     sample_id in_tissue array_row
## AAACAAGTATCTCCCA-1 AAACAAGTATCTCCCA-1 sample_151673         1        50
## AAACAATCTACTAGCA-1 AAACAATCTACTAGCA-1 sample_151673         1         3
##                    array_col ground_truth cell_count  label sizeFactor    x
## AAACAAGTATCTCCCA-1       102       Layer3          6 Layer3  1.8428941 9791
## AAACAATCTACTAGCA-1        43       Layer1         16 Layer1  0.3632188 5769
##                       y      PC1       PC2        PC3        PC4        PC5
## AAACAAGTATCTCCCA-1 8468 4.854207  1.065363 -0.4216272  1.7875615 -0.2024489
## AAACAATCTACTAGCA-1 2807 1.295053 -4.155685  1.3642721 -0.1832911 -1.4907549
##                           PC6       PC7        PC8        PC9       PC10
## AAACAAGTATCTCCCA-1 -0.7078520 -1.019253 -0.4320465 -1.2705471  1.0381851
## AAACAATCTACTAGCA-1 -0.4808741 -2.390182 -2.6052694  0.7815192 -0.8981974
##                         PC11      PC12       PC13      PC14       PC15
## AAACAAGTATCTCCCA-1 0.3112108 -1.010108 -0.8010832 0.4172931 -0.5170250
## AAACAATCTACTAGCA-1 2.6791549 -1.956611  0.2053435 1.6410917  0.6911729
##                          PC16      PC17       PC18       PC19       PC20
## AAACAAGTATCTCCCA-1 -0.2637965 -0.724249  0.2981517 -0.6003563  0.0831525
## AAACAATCTACTAGCA-1 -0.9726718  5.115035 -1.3021309  1.4019829 -0.3443739
##                         PC21         PC22      PC23       PC24        PC25
## AAACAAGTATCTCCCA-1 -0.632314 -0.006210057 0.3180925  0.6784704 -0.31130902
## AAACAATCTACTAGCA-1  1.192591 -1.471208681 0.1300177 -2.1604481 -0.09396313
##                           PC26       PC27       PC28      PC29       PC30
## AAACAAGTATCTCCCA-1 -0.04425171 -0.3914127  1.0541277 -1.210250  0.1229272
## AAACAATCTACTAGCA-1  0.14986324  0.3577853 -0.1819574 -1.830997 -0.1061296
##                         PC31       PC32       PC33       PC34       PC35
## AAACAAGTATCTCCCA-1 0.6136811 0.06691078 -0.1427879  0.2381235 -0.1877092
## AAACAATCTACTAGCA-1 0.2324755 1.14920270 -1.1348378 -0.5429243  0.7826222
##                          PC36      PC37        PC38       PC39       PC40
## AAACAAGTATCTCCCA-1  0.2447371 -1.145837 -0.51591009 -0.4117514 -1.0601292
## AAACAATCTACTAGCA-1 -2.3139240  2.190438 -0.04699769  0.9157029 -0.7106704
##                          PC41        PC42       PC43       PC44       PC45
## AAACAAGTATCTCCCA-1 -0.7755413  0.51368093  0.2211477  0.2450753 0.06177133
## AAACAATCTACTAGCA-1  0.9319064 -0.04174465 -0.5759469 -3.6146601 0.42623415
##                         PC46      PC47       PC48       PC49     PC50    UMAP1
## AAACAAGTATCTCCCA-1 0.5209681 0.7982671 -0.0383644 -0.0163750 0.466978 1.622433
## AAACAATCTACTAGCA-1 3.2290276 3.9403996 -0.5511797 -0.8657158 2.520766 1.930111
##                         UMAP2
## AAACAAGTATCTCCCA-1 -1.7603560
## AAACAATCTACTAGCA-1 -0.8818074

Now, we can create a spatial plot showing manually annotated reference labels in x-y coordinates. We also include several plot formatting options.

# color palette
pal <- c("#F0027F", "#377EB8", "#4DAF4A", "#984EA3", 
         "#FFD700", "#FF7F00", "#1A1A1A", "#666666")

# spatial plot showing annotation labels
ggplot(df, aes(x = x, y = y, color = label)) + 
  geom_point(size = 0.6) + 
  scale_color_manual(values = pal) + 
  coord_fixed() + 
  scale_y_reverse() + 
  ggtitle("Manually annotated") + 
  theme_bw() + 
  theme(panel.grid = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank()) + 
  guides(color = guide_legend(override.aes = list(size = 2)))

Alternatively, create a plot showing expression intensity of a specific gene of interest in x-y coordinates.

# set up data frame
my_gene <- "MOBP"
ix <- which(rowData(spe)$gene_name == my_gene)
df <- cbind(df, gene = counts(spe)[ix, ])
# spatial plot showing gene expression
ggplot(df, aes(x = x, y = y, color = gene)) + 
  geom_point(size = 0.6) + 
  scale_color_gradient(low = "gray85", high = "blue") + 
  coord_fixed() + 
  scale_y_reverse() + 
  ggtitle(my_gene) + 
  theme_bw() + 
  theme(panel.grid = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())

Reduced dimension plots

We can also generate reduced dimension plots for this dataset to further investigate the annotation labels or a gene of interest.

# PCA plot showing annotation labels
ggplot(df, aes(x = PC1, y = PC2, color = label)) + 
  geom_point(alpha = 0.5, size = 0.5) + 
  ggtitle("PCA") + 
  theme_bw() + 
  theme(panel.grid = element_blank()) + 
  guides(color = guide_legend(override.aes = list(size = 2)))

# UMAP plot showing annotation labels
ggplot(df, aes(x = UMAP1, y = UMAP2, color = label)) + 
  geom_point(alpha = 0.5, size = 0.5) + 
  ggtitle("UMAP") + 
  theme_bw() + 
  theme(panel.grid = element_blank()) + 
  guides(color = guide_legend(override.aes = list(size = 2)))

# UMAP plot showing gene expression
ggplot(df, aes(x = UMAP1, y = UMAP2, color = gene)) + 
  geom_point(alpha = 0.5, size = 0.5) + 
  scale_color_gradient(low = "gray85", high = "blue") + 
  ggtitle(my_gene) + 
  theme_bw() + 
  theme(panel.grid = element_blank())